This project aims to detect unusual network activities in a networking system. The original dataset comes from the XYZ Bank’s historical log files. Detailed data description can be found below. To distinguish intrusions from benign sessions, we applied four classification methods on our training dataset, which includes Naive Bayes, Random Forest, Boosting, and K-Nearest Neighbours. Then we performed cross validation to evaluate the predicting power of each method. To identify different types of intrusions, we conducted K-Means clustering and grouped the intrusions into 3 types based on various combinations of attributes.
In sum, our system provides a holistic approach to detecting intrusions and identifying the types. It is helpful in protecting cyber security and users’ privacy.
# Loading Library
library(tidyverse)
library(ggplot2)
library(FactoMineR)
library(knitr)
library(MASS)
library(randomForest)
library(gbm)
library(glmnet)
library(klaR)
library(caret)
library(fastDummies)
library(ROCR)
library(class)
library(ggpubr)load the network traffic data.
headers <- read.csv("network_traffic.csv", header = FALSE, nrows = 1, as.is = TRUE)
network <- read.csv("network_traffic.csv", skip = 2, header = FALSE)
colnames(network) <- headers
network <- network %>% filter (is_intrusion %in% c(0,1))
network$is_intrusion <- as.integer(as.character(network$is_intrusion))
network.copy <- network
str(network)## 'data.frame': 2999 obs. of 23 variables:
## $ duration : int 0 0 0 0 0 0 0 0 0 0 ...
## $ protocol_type : Factor w/ 3 levels "icmp","tcp","udp": 2 2 2 2 2 2 2 2 2 2 ...
## $ service : Factor w/ 16 levels "auth","domain_u",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ flag : Factor w/ 6 levels "REJ","RSTO","RSTR",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ src_bytes : int 302 339 260 213 308 230 221 329 271 326 ...
## $ dst_bytes : int 896 1588 7334 8679 1658 505 445 2431 688 566 ...
## $ land : int 0 0 0 0 0 0 0 0 0 0 ...
## $ wrong_fragment : int 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hot : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_failed_logins : int 0 0 0 0 0 0 0 0 0 0 ...
## $ logged_in : int 1 1 1 1 1 1 1 1 1 1 ...
## $ num_compromised : int 0 0 0 0 0 0 0 0 0 0 ...
## $ root_shell : int 0 0 0 0 0 0 0 0 0 0 ...
## $ su_attempted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_root : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_file_creations: int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_shells : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_access_files : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_outbound_cmds : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_host_login : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_guest_login : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_intrusion : int 0 0 0 0 0 0 0 0 0 0 ...
In network dataset, there’re 22 predictors. They can be divided into discrete variables and continuous variables. Among discrete variables, some variables (protocol_type, service, flag) are multi-categorical variables with more than 2 levels, others (land, logged_in, root_shell, su_attempted, is_host_login, is_guest_login) are binary variables, in which 0 means No and 1 means Yes. The dependent variable is is_intrusion, value of 0 refers to not an intrusion and 1 refers to is an intrusion.
This section is about explanatory data analysis with summaries and plots.
##
## 0 1
## 2699 300
There are 300 intrusion cases in this dataset.
# Intrusion distribution by protocol types
counts <- table(network$is_intrusion, network$protocol_type)
barplot(counts, main="Intrusion Cases by Protocol Types",
xlab="Protocol Types", ylab="Number of Cases",
col=c("darkgreen","red"),
legend = rownames(counts))There are three protocol types. Intrusion cases are distributed only in
tcpandudpprotocol types. Noicmpprotocol is intrusion.
# Intrusion distribution by flag types
counts <- table(network$is_intrusion, network$flag)
barplot(counts, main="Intrusion Cases by Flag Types",
xlab="Flag Types", ylab="Number of Cases",
col=c("darkgreen","red"))
legend("topleft", legend = c("1","0"), fill = c("red", "darkgreen"))net.flag <- network %>%
group_by(flag) %>%
summarize(total = n(), intrusion = sum(is_intrusion),
intrusion.rate = sum(is_intrusion) / n()) %>%
arrange(desc(intrusion.rate))
kable(net.flag, digits = c(0, 0, 0, 3))| flag | total | intrusion | intrusion.rate |
|---|---|---|---|
| RSTR | 65 | 65 | 1.000 |
| S0 | 33 | 33 | 1.000 |
| S3 | 2 | 2 | 1.000 |
| SF | 2744 | 200 | 0.073 |
| REJ | 154 | 0 | 0.000 |
| RSTO | 1 | 0 | 0.000 |
There are six flag types. The bar plot and the above table show that ALL
RSTR,S0, andS3flags turn out to be intrusion cases, while NOREJorRSTOflag case is intrusion.
# Combination of Protocol and Flag
ggplot(data = network) +
geom_count(mapping = aes(x = flag, y = protocol_type, color = as.factor(is_intrusion))) +
labs(color = "Is intrusion or not?") +
scale_color_manual(values = c("darkgreen", "red")) +
ggtitle("Protocol Type versus Flag Type")In this plot, we plotted protocol_type versus flag. Red dots represent intrusion cases, while green dots represent normal cases. The size of the dots represent the number of corresponding cases.
The intrusion distribution conforms to our previous discoveries. In addition, we find that all flag types except
SFbelong totcptype of protocol.
# Intrusion distribution by service types
counts <- table(network$is_intrusion, network$service)
barplot(counts, main="Intrusion Cases by Service Types",
xlab="Service Types", ylab="Number of Cases",
col=c("darkgreen","red"),
legend = rownames(counts))net.service <- network %>%
group_by(service) %>%
summarize(count = n(), count.intrusion = sum(is_intrusion),
intrusion.rate = sum(is_intrusion) / n()) %>%
arrange(desc(intrusion.rate))
kable(net.service, digits = c(0, 0, 0, 3))| service | count | count.intrusion | intrusion.rate |
|---|---|---|---|
| ftp | 45 | 33 | 0.733 |
| private | 247 | 100 | 0.405 |
| ftp_data | 169 | 67 | 0.396 |
| http | 1911 | 100 | 0.052 |
| auth | 8 | 0 | 0.000 |
| domain_u | 185 | 0 | 0.000 |
| eco_i | 17 | 0 | 0.000 |
| ecr_i | 10 | 0 | 0.000 |
| finger | 14 | 0 | 0.000 |
| ntp_u | 21 | 0 | 0.000 |
| other | 58 | 0 | 0.000 |
| pop_3 | 6 | 0 | 0.000 |
| smtp | 290 | 0 | 0.000 |
| telnet | 3 | 0 | 0.000 |
| time | 2 | 0 | 0.000 |
| urp_i | 13 | 0 | 0.000 |
Only
ftp,private,ftp_data, andhttptypes of services have intrusion cases.
# Intrusion and logged-in
ggplot(data = network) +
geom_count(mapping = aes(x = as.factor(is_intrusion), y = as.factor(logged_in), color = as.factor(is_intrusion))) +
scale_color_manual(values = c("darkgreen", "red")) +
guides(color = FALSE) +
ggtitle("Intrusion and Logged-In Success") +
xlab("Is intrusion or not?") +
ylab("Logged in or not?")The size of the circles represent the number of corresponding cases. We find that intrusion cases fail to log in more often.
# Intrusion and duration
ggplot(network, aes(x = as.factor(is_intrusion), y = duration)) +
geom_point() +
ggtitle("Duration Scatter Plot by Intrusion") +
xlab("Is intrusion or not?")The scatter plot shows that the duration of the connection is shorter on average when it is an intrusion case.
# Intrusion and the `hot` indicator
ggplot(network, aes(x = as.factor(is_intrusion), y = hot)) +
geom_point() +
ggtitle("Scatter Plot of `hot` Indicator") +
xlab("Is intrusion or not?") +
ylab("Number of `hot` Indicators")Intrusion cases tend to have fewer
hotindicators.
# Intrusion cases between source and destination
ggplot(network, aes(x = src_bytes, y = dst_bytes, color = as.factor(is_intrusion))) +
geom_point() +
labs(color = "Is intrusion or not?") +
scale_color_manual(values = c("darkgreen", "red")) +
ggtitle("Source Bytes versus Destination Bytes")From this 2-dimensional scatter plot, we find that intrusion cases have more data bytes from source to destination than from destination to source.
Looking through the whole dataset, We also notice that there are several columns contain only the
0value. The following chunk extracts all such columns.
zero.list = c()
for (i in 1:(ncol(network)-1)) {
# Determine if all values are 0
if (nrow(filter(network, network[,i] != 0)) == 0) {
zero.list[i] = i # Store index in the list
}
}
zero.list <- zero.list[is.na(zero.list) == FALSE]
# Extract ALL-ZERO columns
colnames(network)[zero.list]## [1] "land" "wrong_fragment" "urgent"
## [4] "num_failed_logins" "num_outbound_cmds" "is_host_login"
The output shows that the following six columns containing only the
0value:land,wrong_fragment,urgent,num_failed_logins,is_host_login,num_outbound_cmds.
We have two major objectives for this project: classify whether it is an intrusion activity or not, cluster different types of intrusion.
We convert categorical variables to factor and continuous variables to numeric. For KNN and K-Means, we create dummy variables for categorical variables.
In our report, we don’t use Linear Discriminant Analysis (LDA) or Principal Component Analysis (PCA) for dimension reduction. Because our task is to find the abnormal intrusions among all the sessions, which are rare cases. If we use dimension reduction, indicative features are very likely to be lost. So we work on all the original features.
We adapt Naive Bayes, Random Forest, Boosting, KNN model to classify intrusion group, and use cross-validation for model selection.
To identify different types of intrusion, we use K-Means method.
A major task of our project is to predict whether a network session is an intrusion given the predictors such as its duration and protocol type. Since we have the output, y, in our dataset, we can use supervised learning methods to implement the classification problem. We will first build Naive Bayes, Random Forest, Boosting and K-Nearest Neighbors models and then use cross validation and ROC Curve to evaluate the performance of different models.
# fit the model
network.nb <- NaiveBayes(is_intrusion ~ ., data = train, usekernel = T)
# predict the class
network.nb.y.hat <- predict(network.nb, network[test,])$class# draw confusion matrix
network.nb.confusion <- table(network.nb.y.hat, network[test,]$is_intrusion)
network.nb.confusion##
## network.nb.y.hat 0 1
## 0 549 48
## 1 3 0
# calculate error rate
network.nb.misclassification.rate <- mean(network.nb.y.hat != network[test,]$is_intrusion)
network.nb.misclassification.rate## [1] 0.085
In our confusion matrix, predictions are in rows and actual values are in columns. As we can see from the matrix, the Naive Bayes model correctly classifies 549 benign sessions and 0 intrusion. However, it misclassifies 48 intrusions as benign sessions. So even with a relatively low error rate, 8.5%, Naive Bayes works poorly in classifying intrusions.
# fit the model
network.rf <- randomForest(is_intrusion ~., data = train, importance = TRUE)
network.rf##
## Call:
## randomForest(formula = is_intrusion ~ ., data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 4.96%
## Confusion matrix:
## 0 1 class.error
## 0 2147 0 0.0000000
## 1 119 133 0.4722222
plot(network.rf, main = "Error Rate by No. of Trees")
legend("topright", colnames(network.rf$err.rate), col = 1:3, cex = 0.8, lty = 1:3)From the figure above, we can see the error rates of OOB and class
0(a benign session) are small and quite stable when the number of trees is greater than 30. The error rate of class1(an intrusion) is relatively large and becomes stable when the number of trees is greater than about 100.
# find important predictors
important.var <- importance(network.rf)
kable(important.var, digits = c(0,3,3,3,3))| 0 | 1 | MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|---|---|
| duration | 18 | 22.443 | 21.283 | 48.536 |
| protocol_type | 8 | 4.618 | 8.768 | 7.372 |
| service | 11 | 16.496 | 16.419 | 56.011 |
| flag | 20 | 23.023 | 25.011 | 66.586 |
| src_bytes | 12 | 19.033 | 17.959 | 55.014 |
| dst_bytes | 9 | 11.368 | 13.828 | 26.149 |
| land | 0 | 0.000 | 0.000 | 0.000 |
| wrong_fragment | 0 | 0.000 | 0.000 | 0.000 |
| urgent | 0 | 0.000 | 0.000 | 0.000 |
| hot | 8 | 10.475 | 10.975 | 7.595 |
| num_failed_logins | 0 | 0.000 | 0.000 | 0.000 |
| logged_in | 5 | 12.830 | 7.829 | 12.450 |
| num_compromised | 2 | 1.001 | 2.008 | 0.083 |
| root_shell | 0 | 0.000 | 0.000 | 0.000 |
| su_attempted | 0 | 0.000 | 0.000 | 0.000 |
| num_root | -3 | 0.471 | -1.549 | 0.298 |
| num_file_creations | 0 | 1.001 | 1.001 | 0.022 |
| num_shells | 0 | 0.000 | 0.000 | 0.024 |
| num_access_files | 0 | 0.000 | 0.000 | 0.005 |
| num_outbound_cmds | 0 | 0.000 | 0.000 | 0.000 |
| is_host_login | 0 | 0.000 | 0.000 | 0.000 |
| is_guest_login | 8 | 11.802 | 11.954 | 7.371 |
From the Variable Importance Plot above, we can see
flag,service,src_bytes,durationanddst_bytesare variables with top 5 high Mean Decrease Gini. They are the important variables to predict an intrusion based on the Random Forest Model.
# predict the class
network.rf.y.hat <- predict(network.rf, network[test,])
# draw confusion matrix
network.rf.confusion <- table(network.rf.y.hat, network[test,]$is_intrusion)
network.rf.confusion##
## network.rf.y.hat 0 1
## 0 552 18
## 1 0 30
# calculate error rate
network.rf.misclassification.rate <- mean(network.rf.y.hat != network[test,]$is_intrusion)
network.rf.misclassification.rate## [1] 0.03
The predictability of Random Forest Model on intrusions is slightly higher than the Naive Bayes Model. It can correctly recognize almost all benign sessions (552 of 552) and most intrusions (30 of 48). The overall error rate is 3%.
Select important predictors based on rf result, and color the intrusion data for each pair of important predictors.
important <- data.frame(importance(network.rf)) %>% mutate (name = rownames(importance(network.rf)))
important <- important %>% arrange(-MeanDecreaseAccuracy)
# select first 10 important predictors based on random forest result
important.predictor <- important[1:10, 5]
pairs(train[,match(important.predictor, names(train))], col = c("cornflowerblue", "purple")[train$is_intrusion])# fit the model
network.boosting <- gbm(is_intrusion ~., data = train,
distribution = "multinomial",
n.trees = 5000, interaction.depth = 4)
network.boosting## gbm(formula = is_intrusion ~ ., distribution = "multinomial",
## data = train, n.trees = 5000, interaction.depth = 4)
## A gradient boosted model with multinomial loss function.
## 5000 iterations were performed.
## There were 22 predictors of which 10 had non-zero influence.
## var rel.inf
## dst_bytes dst_bytes 5.229369e+01
## service service 3.411658e+01
## flag flag 4.905660e+00
## duration duration 3.135220e+00
## protocol_type protocol_type 3.035872e+00
## src_bytes src_bytes 2.456172e+00
## logged_in logged_in 5.673620e-02
## is_guest_login is_guest_login 6.127751e-05
## num_root num_root 3.313331e-07
## hot hot 8.432320e-10
## land land 0.000000e+00
## wrong_fragment wrong_fragment 0.000000e+00
## urgent urgent 0.000000e+00
## num_failed_logins num_failed_logins 0.000000e+00
## num_compromised num_compromised 0.000000e+00
## root_shell root_shell 0.000000e+00
## su_attempted su_attempted 0.000000e+00
## num_file_creations num_file_creations 0.000000e+00
## num_shells num_shells 0.000000e+00
## num_access_files num_access_files 0.000000e+00
## num_outbound_cmds num_outbound_cmds 0.000000e+00
## is_host_login is_host_login 0.000000e+00
The important variables of Boosting Model are very similar to the Random Forest Model. But their relative influence is in a different order.
dst_bytes,service,flaganddurationappear to be included in the 5 most important variables in both models.
# predict probability
network.bst.p <- predict(network.boosting, network[test,],
n.trees = 5000, type = "response")
# predict the class
network.bst.y.hat <- colnames(network.bst.p)[apply(network.bst.p, 1, which.max)]# draw confusion matrix
network.bst.confusion <- table(network.bst.y.hat, network[test,]$is_intrusion)
network.bst.confusion##
## network.bst.y.hat 0 1
## 0 532 1
## 1 20 47
# calculate error rate
network.bst.misclassification.rate <- mean(network.bst.y.hat != network[test,]$is_intrusion)
network.bst.misclassification.rate## [1] 0.035
The Boosting Model can identify almost all intrusions correctly (47 of 48) but its predictability of benign sessions is lower than Naive Bayes and Random Forest(532 of 552). Its overall error rate, 3.5%, is very close to that of Random Forest.
In performing KNN, we use network.dummy as dataset, which converts categorical variables (more than 2 levels) into dummy variables.
# Subset training data
train.dummy <- network.dummy[-test,]
# Define the cross validation func for KNN
cv.knn = function(model){
set.seed(2020)
# create 10 folds
folds = createFolds(train.dummy$is_intrusion, k = 10)
cv = lapply(folds, function(x) {
training_fold = train.dummy[-x, ]
test_fold = train.dummy[x, ]
# apply the classifer on the training_fold
classifier = model
y_pred = predict(classifier, newdata = test_fold[-45],type = "class")
cm = table(as.factor(test_fold[, 45]), y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
# calculate mean for correct rate for 10 folds
cv.average = (cv$Fold01 + cv$Fold02 + cv$Fold03 + cv$Fold04 + cv$Fold05 + cv$Fold06 + cv$Fold07 +
cv$Fold08 + cv$Fold09 + cv$Fold10)/10
return(cv.average)
}# Apply cross validation to select best fit of K
network.knn.f <-function(x){
network.knn <- knn3(as.factor(is_intrusion) ~ ., data = train.dummy, k = x)
cv.knn(network.knn)
}
correct.knn<-rep(0:19)
for (i in 1:20) {
set.seed(2020)
correct.knn[i] <- network.knn.f(i)
}
# Plot correct rate for different value of K
plot(correct.knn)
points(which.max(correct.knn),correct.knn[which.max(correct.knn)],col="red",cex=2,pch=20)We can find that with K=1, KNN will achieve the highest accuracy. So next we use K=1 to develop our model.
# Fit the model with K that has largest correct rate
network.knn <- knn3(as.factor(is_intrusion) ~ ., data = train.dummy, k = which.max(correct.knn))
network.knn.y.hat <- predict(network.knn, network.dummy[test,], type = "class")# Draw confusion matrix
network.knn.confusion <- table(network.knn.y.hat, network.dummy[test,]$is_intrusion)
network.knn.confusion##
## network.knn.y.hat 0 1
## 0 532 2
## 1 20 46
# Error rate on the test set
network.knn.misclassification.rate <- mean(network.knn.y.hat != network.dummy[test,]$is_intrusion)
network.knn.misclassification.rate## [1] 0.03666667
The predictability of KNN Model is very close to Boosting. It correctly classifies 532 of 552 benign sessions and 46 of 48 intrusions. Its overall error rate on the test data is 3.67%.
Next we’ll explore which attributes yield impacts on classification. Based on test.result, we found several attributes have all inputs as 0, and we also found several representative variables for later plotting.
# Comine predicted value with original network test data
test.result <- cbind(network[test,], network.knn.y.hat) # no dummy added# Plot for KNN with 2 selected representative dimensions
plot.1 <- ggplot(data = test.result, aes(x = hot, y = protocol_type, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw()
plot.2 <- ggplot(data = test.result, aes(x = src_bytes, y = dst_bytes, color = network.knn.y.hat)) +
geom_point() + theme_bw()
plot.3 <- ggplot(data = test.result, aes(x = src_bytes, y = flag, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw()
plot.4 <- ggplot(data = test.result, aes(x = src_bytes, y = duration, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw()
plot.5 <- ggplot(data = test.result, aes(x = protocol_type, y = flag, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw()
plot.6 <- ggplot(data = test.result, aes(x = service, y = protocol_type, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))
plot.7 <- ggplot(data = test.result, aes(x = service, y = flag, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))
plot.8 <- ggplot(data = test.result, aes(x =duration, y = flag, color = network.knn.y.hat)) +
geom_point(position = "jitter") + theme_bw()
# Combine in one plot
ggarrange(plot.1, plot.2, plot.3, plot.4, plot.5, plot.6, plot.7, plot.8,
ncol = 2, nrow = 4)From those plots, we can find some variables play important role in classfication. For example, higher
src_bytesleads to intrusion and higherdst_bytesleads to no intrusion;flagofS0andRSTRtend to lead to intrusion andREJtend to lead to no intrusion.
For this section, we use cross-validation and ROC curve to compare different models and conduct the selection.
Apply cross validation for 4 different models, and select the one with the largest correct rate:
# Similar to the former function, create cross validation function for Naive Bayes, Random Forest and Boosting
# CV function for Naive Bayes
cv.nb = function(model){
set.seed(2020)
folds = createFolds(train$is_intrusion, k = 10)
cv = lapply(folds, function(x) {
training_fold = train[-x, ]
test_fold = train[x, ]
# now apply (train) the classifer on the training_fold
classifier = model
y_pred = predict(classifier, newdata = test_fold[-23])$class
cm = table(test_fold[, 23], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
cv.average = (cv$Fold01 + cv$Fold02 + cv$Fold03 + cv$Fold04 + cv$Fold05 + cv$Fold06 + cv$Fold07 +
cv$Fold08 + cv$Fold09 + cv$Fold10)/10
return(cv.average)
}# CV function for Random Forest
cv.rf = function(model){
set.seed(2020)
folds = createFolds(train$is_intrusion, k = 10)
cv = lapply(folds, function(x) {
training_fold = train[-x, ]
test_fold = train[x, ]
# now apply (train) the classifer on the training_fold
classifier = model
y_pred = predict(classifier, newdata = test_fold[-23])
cm = table(test_fold[, 23], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
cv.average = (cv$Fold01 + cv$Fold02 + cv$Fold03 + cv$Fold04 + cv$Fold05 + cv$Fold06 + cv$Fold07 +
cv$Fold08 + cv$Fold09 + cv$Fold10)/10
return(cv.average)
}# CV function for for Boosting
cv.boosting = function(model){
set.seed(2020)
folds = createFolds(train$is_intrusion, k = 10)
cv = lapply(folds, function(x) {
training_fold = train[-x, ]
test_fold = train[x, ]
# now apply (train) the classifer on the training_fold
classifier = model
y_pred = predict(classifier, newdata = test_fold[-23], n.trees = 5000, type = "response")
y_pred <- colnames(y_pred)[apply(y_pred, 1, which.max)]
cm = table(test_fold[, 23], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
cv.average = (cv$Fold01 + cv$Fold02 + cv$Fold03 + cv$Fold04 + cv$Fold05 + cv$Fold06 + cv$Fold07 +
cv$Fold08 + cv$Fold09 + cv$Fold10)/10
return(cv.average)
}Cross validation results are as follows:
nb.cv <- cv.nb(network.nb)
rf.cv <- cv.rf(network.rf)
boosting.cv <- cv.boosting(network.boosting)
knn.cv <- cv.knn(network.knn)
cv.result <- data.frame(Method= c("Naive Bayes", "Random Forest", "Boosting", "KNN"),
Result = c(nb.cv, rf.cv, boosting.cv, knn.cv))
kable(cv.result, digits = c(0,5))| Method | Result |
|---|---|
| Naive Bayes | 0.89704 |
| Random Forest | 0.95040 |
| Boosting | 0.97622 |
| KNN | 0.97623 |
# Calculate predictor evaluations for Naive Bayes
network.nb.score <- predict(network.nb, network[test,])$posterior[,2]
nb.pred <- prediction(network.nb.score, network[test,]$is_intrusion)
nb.perf <- performance(nb.pred, "tpr", "fpr")
nb.auc <- performance(nb.pred, "auc")@y.values[[1]]
# Calculate predictor evaluations for Random Forest
network.rf.score <- predict(network.rf, network[test,], type = "prob")[,2]
rf.pred <- prediction(network.rf.score, network[test,]$is_intrusion)
rf.perf <- performance(rf.pred, "tpr", "fpr")
rf.auc <- performance(rf.pred, "auc")@y.values[[1]]
# Calculate predictor evaluations for Boosting
network.bst.score <- predict(network.boosting, network[test,],
n.trees = 5000, type = "response")[,2,]
bst.pred <- prediction(network.bst.score, network[test,]$is_intrusion)
bst.perf <- performance(bst.pred, "tpr", "fpr")
bst.auc <- performance(bst.pred, "auc")@y.values[[1]]
# Calculate predictor evaluations for KNN
network.knn.score <- predict(network.knn, network.dummy[test,], type = "prob")[,2]
knn.pred <- prediction(network.knn.score, network.dummy[test,]$is_intrusion)
knn.perf <- performance(knn.pred, "tpr", "fpr")
knn.auc <- performance(knn.pred, "auc")@y.values[[1]]
# Create legend of ROC Curve
lgd <- paste(c("Naive Bayes", "Random Forest", "Boosting", "KNN"), " (AUC:",
c(round(nb.auc, 6),
round(rf.auc, 6),
round(bst.auc, 6),
round(knn.auc, 6)),
")", sep = "")
# Plot ROC Curve
plot(nb.perf, col = 1, lty = 1, lwd=1.5, main = "ROC")
plot(rf.perf, col = 2, lty = 2, lwd=1.5, add = TRUE)
plot(bst.perf, col = 3, lty = 3, lwd=1.5, add = TRUE)
plot(knn.perf, col = 4, lty = 4, lwd=1.5, add = TRUE)
legend("bottomright", legend = lgd,
col = c(1,2,3,4), lty = c(1,2,3,4), lwd = 1.5)The ROC Curves of all the 4 models seem to be close to the optimal curve through the point(0,1). Obviously, Random Forest, Boosting and KNN outperform Naive Bayes. Random Forest has the highest AUC among the four.
Now we’re performing unsupervised learning via K-menas clustering to identify possible types of network intrusion. The data we used is 300 intrusion observations with converting muiti-categorial variables to multiple dummy variables.
# Select `is.intrusion` obs
is.intrusion.dummy <- filter(network.dummy, is_intrusion == 1)
is.intrusion <- filter(network, is_intrusion == 1)
# Observe possible clusters with important predictors
pairs(is.intrusion[,match(important.predictor, names(is.intrusion))])From the pairs plot, we can observe that the data is grouped into 3 clusters approximately.
# Run K-means clustering
set.seed(3)
km.out<- kmeans(is.intrusion.dummy[, -45], centers = 3, nstart=20)
# Save cluster results
cluster <- as.data.frame(km.out$cluster)
cluster %>% group_by(km.out$cluster) %>%
summarize(count = n())## # A tibble: 3 x 2
## `km.out$cluster` count
## <int> <int>
## 1 1 210
## 2 2 60
## 3 3 30
# Combine cluster data with original intrusion data
intrusion.re <- cbind(is.intrusion, cluster = factor(km.out$cluster))
intrusion.re.dummy <- cbind(is.intrusion.dummy, cluster = factor(km.out$cluster))
# pair plot for important predictors, colored by clustering results
pairs(intrusion.re[,match(important.predictor, names(intrusion.re))], col = intrusion.re$cluster)Here we display the pair plot with color of different 3 clusters we learned. We can find that 3 clusters are reasonable, and there exist several important variables in differentiating them. Next with more detailed plots we can get more insights.
# More detailed plotting results
plot(intrusion.re[,5:6], col=(km.out$cluster+1), main="K-Means Clustering Results with K=3",
xlab="src_bytes", ylab="dst_bytes", pch=20, cex = 2)ggplot(data = intrusion.re, aes(x = hot, y = protocol_type, color = cluster)) +
geom_point(position = "jitter") + theme_bw()ggplot(data = intrusion.re, aes(x = src_bytes, y = hot, color = cluster)) +
geom_point(position = "jitter") + theme_bw()ggplot(data = intrusion.re, aes(x = hot, y = is_guest_login, color = cluster)) +
geom_point(position = "jitter") + theme_bw()ggplot(data = intrusion.re, aes(x = protocol_type, y = flag, color = cluster)) +
geom_point(position = "jitter") + theme_bw()ggplot(data = intrusion.re, aes(x = service, y = protocol_type, color =cluster)) +
geom_point(position = "jitter") + theme_bw() ggplot(data = intrusion.re, aes(x = service, y = flag, color = cluster)) +
geom_point(position = "jitter") + theme_bw()In summary,
src_bytesis important to classify 3 different intrusion patterns. Intrusion records withprotocol_typeofudptend to be grouped in one cluster, as well asguest loginintrusion records, and records with higher number ofhotindicators. Intrusion records withserviceofftpandprivateare group to one cluster. Moreover,flagofS0andRSTRtend to clearly group intrusions into 2 different clusters.
Previously, we used K-means to detect 3 clusters with label of 1, 2, 3. We’ll then label not intrusion as cluster 0, in together get data of all records with 4 clusters.
# No intrusion data is assigned cluster label as 0
no.intrusion.dummy <-filter(network.dummy, is_intrusion != 1) %>%
mutate(cluster = 0)
# Combine data with 4 cluster labels
network.con <- rbind(no.intrusion.dummy, intrusion.re.dummy)
# A look at cluster grouping
network.con %>% group_by(cluster) %>%
summarize(count = n())## # A tibble: 4 x 2
## cluster count
## <chr> <int>
## 1 0 2699
## 2 1 210
## 3 2 60
## 4 3 30
After labeling all data records, apply KNN classification with predicted cluster results.
# Run KNN and predict
network.knn.cluster <- knn3(as.factor(cluster) ~ ., data = network.con[-test,],
k = which.max(correct.knn))
network.knn.cluster.y.hat <- predict(network.knn.cluster, network.con[test,], type = "class")
# Confusion matrix
network.knn.cluster.confusion <- table(network.knn.cluster.y.hat, network.con[test,]$cluster)
network.knn.cluster.confusion ##
## network.knn.cluster.y.hat 0 1 2 3
## 0 531 0 0 0
## 1 2 44 0 0
## 2 1 0 19 0
## 3 0 0 0 3
network.knn.cluster.misclassification.rate <- mean(network.knn.cluster.y.hat != network.con[test,]$cluster)
network.knn.cluster.misclassification.rate## [1] 0.005
The patterns prediction appears to be good with low error rate.
In Part 2, we did exploratory data analysis on the original dataset. We detected the following interactions between the responsive variable is_intrusion and independent variables:
There are three protocol types. Intrusion cases are distributed only in tcp and udp protocol types. No icmp protocol is intrusion.
There are six flag types. ALL RSTR, S0, and S3 flags turn out to be intrusion cases, while NO REJ or RSTO flag case is intrusion.
Intrusion cases encounter failure in logging in more often.
On average, the duration of the connection is shorter when it is an intrusion case.
Intrusion cases tend to have fewer hot indicators.
By plotting dst_bytes against src_bytes, we find that intrusion cases have more data bytes from source to destination than from destination to source.
Note: We didn’t control for other variables when doing EDA, thus the findings above only help readers gain a clearer picture on the dataset while no solid conclusions can be drawn.
To clarify, in our report, we don’t use LDA/PCA for dimension reduction. Because our task is to find the abnormal intrusions among all the sessions, which are rare cases. If we use dimension reduction, indicative features are very likely to be lost. So we work on all the original features.
In Part 3, we used various data mining methods to address the task problems:
1. Determine if it is possible to differentiate between the labeled intrusions and benign sessions.
Yes. This is a typical classification problem with intrusions labled as is_intrusion = 1 and benign sessions labeled as is_intrusion = 0.
In Part 3.1 we applied four models to distinguish intrusions from benign sessions: Naive Bayes, Random Forest, Boosting, K-Nearest Neighbours.
After model selection(see more in Q4), we adapted Boosting and K-Nearest Neighbours in our system.
2. Is it possible to identify different types of intrusions? If so, which values of which attributes in data correlate with the specific types of intrusions?
Yes, though the group division of intrusions is unknown, we can identify different types of intrusions via unsupervised learning.
In Part 3.2 we first applied K-Means clustering to identify different types of intrusions. Intrusions can be classified into 3 types based on various combinations of attributes.At this point, all data can be categorized into 4 groups: benign session and 3 types of intrusions. Then we used KNN to train the data and see how it worked when predicting the four groups. It turned out KNN had a very low misclassification rate of 0.5%, indicating our K-Means method makes sense.
After trying out different combinations of attributes, we find that src_bytes is important to classify 3 different intrusion patterns (Figure 1). Intrusion records with protocol_type of udp tend to be grouped in one cluster (Figure 2,3,4), as well as guest login intrusion records (Figure 5), and records with higher number of hot indicators (Figure 2). Intrusion records with service of ftp and private are group to one cluster (Figure 4). Moreover, flag of S0 and RSTR tend to clearly group intrusions into 2 different clusters (Figure 6).
3. Develop and implement a systematic approach to detect instances of intrusions in log files. Your system will need to be able to take a new network_traffic log file and determine the existence of known patterns of intrusions as well as anomalies which may be indicative of new and unknown intrusion patterns.
We apply a three-step systematic approach to deal with new-coming records:
Step 1: detect whether a new activity is intrusion.
We applied four classification methods initially, and after model selection, we decide to adapt the two top performing methods: Boosting and KNN (see more on Q4).
In this case, the error of missing an intrusion is more costly than mistakening a benign session as intrusion. In other words, we tend to label a suspicious case as intrusion than letting it go.
Therefore, an activity will be identified as intrusion as long as one of the two models yields the prediction of intrution.
Step 2: determine the patterns of intrusions
After identifying all intrusion records, plug them into the K-Means model and they will be grouped into 3 types.
Step 3: examine whether there are anomalies indicating new patterns
In Part 2 EDA we found that there are 6 attributes that contain only values of 0. If there are new records whose 6 attributes contain 1 or other values, they might indicate the emergence of new types of intrusion patterns.
4. Evaluate detection power of your system.
We used misclassification rate, sensitivity, cross-validation accuracy and ROC to examine the detction power of our classification methods.
In conclusion, in terms of misclassification rate, sensitivity, cross-validation accuracy and ROC, the Boosting and KNN methods we adapted in our system have good detection power.
5. Can your intrusion detector be used in real-time? It would need to be able to receive data about a current session, and in seconds determine if it is likely to be and intrusion of previously seen type or an anomaly potentially signifying an unseen yet intrusion mode. What information should be exchanged via the user interface of such system?
Yes, it can detect and classify intrusions in real time. The work flow of our system is as follows:
Once a new record arrives, it will be plugged into the Boosting and KNN models, and if one or more models give the prediction of is_intrusion = True, it would be labeled as intrusion.
If a record is labeled as intrusion, it will be plugged into K-Means model to determine its intrusion type.
Once every 1000 new records flow in, the system will update itself by splitting training and test data again and generate new models.
[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.
[2] Suthaharan, S. (2014). Big data classification: Problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Performance Evaluation Review, 41(4), 70-73.
[3] Muda, Z., Yassin, W., Sulaiman, M. N., & Udzir, N. I. (2011, July). Intrusion detection based on K-Means clustering and Naïve Bayes classification. In 2011 7th International Conference on Information Technology in Asia (pp. 1-6). IEEE.
[4] Dewa, Z., & Maglaras, L. A. (2016). Data mining and intrusion detection systems. International Journal of Advanced Computer Science and Applications, 7(1), 62-71.
[5] Dokas, P., Ertoz, L., Kumar, V., Lazarevic, A., Srivastava, J., & Tan, P. N. (2002, November). Data mining for network intrusion detection. In Proc. NSF Workshop on Next Generation Data Mining (pp. 21-30)
[6] Ernst, Felix G.M. “ROCR v1.0-11.” ROCR Package | R Documentation, 2020, www.rdocumentation.org/packages/ROCR/versions/1.0-11.
[7] Ding, C., & He, X. (2004, March). K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization. In Proceedings of the 2004 ACM symposium on Applied computing (pp. 584-589)
[8] Data Mining and Intrusion Detection System, 2016. https://thesai.org/Downloads/Volume7No1/Paper_9-Data_Mining_and_Intrusion_Detection_Systems.pdf
[9] Data Mining for Network Intrusion Detection, 2002. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.6701&rep=rep1&type=pdf